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Abstract 

Wc present a case-study on the utility of graphics cards to perform massively parallel sim- 
ulation of advanced Monte Carlo methods. Graphics cards, containing multiple Graphics Pro- 
cessing Units (GPUs), are self-contained parallel computational devices that can be housed 
in conventional desktop and laptop computers. For certain classes of Monte Carlo algorithms 
they offer massively parallel simulation, with the added advantage over conventional distributed 
multi-core processors that they are cheap, easily accessible, easy to maintain, easy to code, 
dedicated local devices with low power consumption. On a canonical set of stochastic simula- 
tion examples including population-based Markov chain Monte Carlo methods and Sequential 
Monte Carlo methods, we find speedups from 35 to 500 fold over conventional single-threaded 
computer code. Our findings suggest that GPUs have the potential to facilitate the growth 
of statistical modelling into complex data rich domains through the availability of cheap and 
accessible many-core computation. We believe the speedup we observe should motivate wider 
use of parallelizable simulation methods and greater methodological attention to their design. 
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1 Introduction 



We describe a case-study in the utility of graphics cards involving Graphics Processing Units (GPUs) 
to perform local, dedicated, massively parallel stochastic simulation. GPUs were originally devel- 
oped as dedicated devices to aid in real-time graphics rendering. However recently there has been 
an emerging lit erature on their us e for scientific comput i ng as they house multi-core processors. Ex- 
amples include Stone et al. ( 2007 ) and Friedrichs et al. ( 20091 ) . which discuss their use in molecular 
modelling and dynamics. 



To gain an understanding of the potential benefits to statisticians we have investigated speedups on 
a canonical set of examples taken from the advanced Monte Carlo literature. These include Bayesian 
inference for a Gaussian mixture model computed using a population-based Markov chain Monte 
Carlo (MCMC) method and a sequential Monte Carlo (SMC) sampler and sequential Bayesian 
inference for a multivariate stochastic volatility model implemented using a standard SMC method, 
also known as a particle filter in this context. In these examples we report substantial speedups 
from the use of GPUs over conventional CPUs. 



The potential o f para llel processing to aid in statistical computing is well documented (see e.g. 
Kontoghiorghes ( 20061 )). However, previous studies have relied on distributed multi-core clusters 



of CPUs for implementation. In contrast, graphics cards for certain generic types of computation 
offer parallel processing speedups with advantages on a number of fronts, including: 



• Cost: graphics cards are relatively cheap. At time of writing the cards used in our study 
retail at around 200 US dollars. 

• Accessibility: graphics cards are readily obtainable from high street computer stores or over 
the internet. 

• Maintenance: the devices are self-contained and can be hosted on conventional desktop and 
laptop computers. 

• Speed: in line with multi-core CPU clusters, graphics cards offer significant speedup, albeit 
for a restricted class of scientific computing algorithms. 

• Power: GPUs are low energy consumption devices compared to clusters of traditional com- 
puters, with a graphics card requiring around 200 Watts. While improvements in energy 
efficiency are application-specific, it is reasonable in many situations to expect a GPU to use 
around 10 per cent of the energy to that of an equivalent CPU cluster. 

• Dedicated and local: the graphics cards slot into conventional computers offering the user 
ownership without the need to transport data externally. 



The idea of splitting the computational effort of parallelizable algorithms amongst processors is 
certainly not new to statisticians. In fact, distributed systems and clusters of computers have been 
around for decades. Previous wor k on paralle l izatio n of MCMC me t hods on a group o f netwo rked 
computers include, among others, Rosenthal ( 2000l ) and Brockwell ( 20061 ). Rosenthal ( 2000l ) dis- 
cusses how to deal with computers running at different speeds and potential computer failure while 
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Brockwelll tood ) discusses the parallel implementation of a standard single chain MCMC algo- 



rithm by pre-computing acceptance ratios. The latency and bandwidth of communication in these 
systems make them suitable only in cases where communication between streams of computation, 
or threads, is infrequent and low in volume. In other words, while many algorithms involve com- 
putation that could theoretically be distributed amongst processors, the overhead associated with 
distributing the work erases any speedup. In contrast, many-core processor communication has 
very low latency and very high bandwidth due to high-speed memory that is shared amongst the 
cores. Low latency here means the time for a unit of data to be accessed or written to memory 
by a processor is low whilst high bandwidth means that the amount of data that can be sent in 
a unit of time is high. For many algorithms, this makes parallelization viable where it previously 
was not. In addition, the energy efficiency of a many-core computation compared to a single-core 
or distributed computation can be improved. This is because the computation can both take less 
time and require less overhead. 

We choose to investigate the speed up for the simulation of random variates from complex dis- 
tributions, a comm o n com putational task when performing inference using Monte Carlo (see e.g. 
Robert and Casellal tooj )). In particular, we focus on population-based MCMC methods and SMC 



methods for producing random variates as these are not algorithms that typically see significant 
speedup on clusters due to the need for frequent, high-volume communication between computing 
nodes. For the examples we consider, we find that computation time can be significantly lowered 
for all applications, and drastically lowered in some cases. This means that we can obtain the 
samples we want in seconds instead of hours and minutes instead of days. The types of speedup 
observed are dependent on the ability of the methods to be parallelized. In particular, speedup 
increases with the number of auxiliary distributions in population-based MCMC and the number 
of particles in SMC until some device-specific capacity is reached. 

The algorithms are implemented for the Compute Unified Device Architecture (CUDA) and make 
use of CPUs which support this architecture. CUDA offers a fairly mature development environ- 
ment via an extension to the C programming language. We estimate that a programmer proficient 
in C should be able to code effectively in CUDA within a few weeks of dedicated study. For our 
apphcations we use CUDA version 2.1 with an NVIDIA GTX 280 as weh as an NVIDIA 8800 GT. 
The GTX 280 has 30 multiprocessors while the 8800 GT has 14 multiprocessors. For all current 
NVIDIA cards, a multiprocessor comprises 8 arithmetic logic units (ALUs), 2 special units for tran- 
scendental functions, a multithreaded instruction unit and on-chip shared memory. In addition to 
having more multiprocessors than the 8800 GT, each GTX 280 multiprocessor itself has more reg- 
isters, can support more active threads and includes one double-precision ALU. For single-precision 
floating point computation, one can think of the GTX 280 as having 240 (30 x 8) and the 8800 GT 
as having 112 (14 x 8) single processors respectively. At present, the retail price of the GTX 280 is 
just over double that of the 8800GT and it requires just over twice the power. 



2 GPUs for Parallel Processing 



GPUs have evolved into many-core processing units, currently with up to 30 multiprocessors per 
card, in response to commercial demand for real-time graphics rendering, independently of demand 
for many-core processors in the scientific computing community. As such, the architecture of 
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Figure 1: Link between host and graphics card. The thicker hnes represent higher data bandwidth 
while the squares represent processor cores. 

GPUs is very different to that of conventional central processing units (CPUs). An important 
difference is that GPUs devote proportionally more transistors to ALUs and less to caches and 
flow control in comparison to CPUs. This makes them less general purpose but highly effective 
for data-parallel computation with high arithmetic intensity, i.e. computations where the same 
instructions are executed on different data elements and where the ratio of arithmetic operations 
to memory operations is high. This single instruction, multiple data (SIMD) architecture puts a 
heavy restriction on the types of computation that optimally utilize the GPU but in cases where 
the architecture is suitable it reduces overhead. 

Figured] gives a visualization of the link between a host machine and the graphics card, emphasizing 
the data bandwidth characteristics of the links and the number of processing cores. A program 
utilizing a GPU is hosted on a CPU with both the CPU and the GPU having their own memory. 
Data is passed between the host and the device via a standard memory bus, similar to how data is 
passed between main memory and the CPU. The memory bus between GPU memory and the GPU 
cores is both wider and has a higher clock rate than a standard bus, enabling much more data to 
be sent to the cores than the equivalent link the host. This type of architecture is ideally suited to 
data-parallel computation since large quantities of data can be loaded into registers for the cores 
to process in parallel. In contrast, typical computer architectures use a cache to speed up memory 
accesses using locality principles that are generally good but do not fully apply to data-parallel 
computations, with the absence of temporal locality most notable. 

2.1 Programming with Graphics Cards 

CUDA provides the interface to compliant GPUs by extending the C programming language. Pro- 
grams compiled with CUDA allow computation to be split between the CPU and the GPU. In this 
sense, the GPU can be treated as an additional, specialized processor for data-parallel computation. 
In the following text, host code refers to code that is executed on the CPU whilst device code is 
code that is executed on the GPU. We present a simple example in Figures [2][H explained below, 
that computes a classical importance sampling estimate (see Section [3]). In the code snippets, 
keywords in the C language are in bold face whilst CUDA keywords are both bold and italicized. 
A line beginning with a "//" is a comment and is ignored by the compiler. 
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global void importance_sample (int N, float* d_array, float* d_array_out) { 

// thread id = threads per block * block id + thread id within block 

const int tid = blockDim.x * blockldx.x + threadldx.x; 

// total number of threads = threads per block * number of blocks 

const int tt = blockDim.x * gridDim.x; 

int i ; 

float w, x; 

for (i = tid; i < N; i += tt) { 

X = d_array [ i ] ; 

w = target_pdf (x) / proposal_pdf (x) ; 
d_array_out [ i ] = phi (x) * w; 

} 

} 



Figure 2: Kernel that evaluates an importance weight and test function 


} 


_ device 

return 

+ 


float target_pdf (float x) { 

l.Of / sqrtf(2 * PI) * exp(-(x - 1.5) * (x - 1.5) / 0.5f) 
l.Of / sqrtf(2 * PI) * exp(-(x +1) * (x + 1) / 0.5f); 


} 


_ device 

return 


float proposal_pdf (float x) { 

l.Of / sqrtf(2 * PI) * exp(-x * x / 2. Of); 


} 


.device 

return 


float phi (float x) { 

X * x; 



Figure 3: Device functions for evaluating the target density, the proposal density and the test 
function. The target is an equally weighted, two-component mixture of normals with equal variances 
of 0.25 and means at -1 and 1.5 while the proposal is a standard normal distribution. The test 
function squares its input so that the integral that is estimated is the expectation of the second 
moment of a random variable distributed according to the target density. 

CUDA allows users to define special functions, called kernels, that are called by the host code to be 
executed in parallel on the GPU by a collection of threads. Figure [2] shows an example of a kernel 
function, which can be invoked in host code using the syntax 

importance_sample<<<nb, nt>>> (N, d_array, d_array_out ) ; 

where nb is the number of blocks of threads and nt is the number of threads per block. The total 
number of threads created by this call is the product of nb and nt and one can think of a threads as 
being a single stream of computation. For most kernels, the numbers of threads and blocks can be 
changed to tune performance on different cards or with different data. A more detailed description 
of blocks and threads and their relation to the hardware is given in Section 12. 2[ 

A kernel is defined with the __global_ qualifier. Kernels are special in that they are always invoked 
in parallel with the numbers of blocks and threads specified and have a void return type. As such, 
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int N = 16777216; 

float h_sum, result; 
float* d_array; 
float* d_array_out; 

float* array = (float*) inalloc(N * sizeof (float) ) ; 
cudaMalloc { {-void **) &d_array, N * sizeof (float) ) ; 
cudaMalloc { (void **) &d_array_out, N * sizeof (float) ) ; 

populate_randn (array, N) ; 

cudaMemcpy (d_array, array, N * sizeof (float) , cudaMemcpyHostToDevice) ; 

is«<64, 128»>(N, d_array, d_array_out) ; 
h_sum = reduce (N, d_array_out) ; 
result = h_sum / N; 

free (array) ; 
cudaFree ( d_a r r a y ) ; 
cudaFree (d_array_out) ; 



Figure 4: Host code 



program correctness depends only on how the threads invoked in the kernel call modify memory on 
the graphics card. In particular, code must be written that guarantees the correct modifications 
to memory once all threads have completed, especially when threads interact during execution by 
reading and writing from shared memory locations. In Figure [21 a kernel is defined that takes as 
input an array of random values sampled from a proposal distribution and places, for each value, 
the product of the test function and the importance weight at that value in a separate array. One 
can see that each thread is responsible for N/tt values, assuming N is a multiple of tt. Within 
a kernel, special functions can be called that have been defined with the __device__ qualifier. 
These functions can only be called by __global_ functions or __device_ functions themselves. 
In Figure [21 target_pdf , proposal_pdf and phi are examples of this, and their definitions are 
provided in Figure [3l In this particular kernel we see that each thread first computes its absolute 
thread identifier tid and the total number of threads tt. It then computes an importance weight 
and evaluates the test function for each value in d_array it is responsible for and stores the result 
in d_array_out. Since there is no thread interaction in this example kernel, it is reasonably 
straightforward to verify its correctness. 

Figure [H gives a snippet of code that is run on the host and completes our example. First, memory 
is allocated on both the host and the graphics card using the ma Hoc and cudaMalloc functions 
respectively. The host function populate_randn then puts N standard normal random variates 
in array. These values are copied into the GPU array, d_array, via the cudaMemcpy function. 
In Figure [H this is a transfer along the memory bus that connects host and graphics card memory. 
At this point, the kernel is called with 64 blocks of 128 threads per block. The reduce function 
is a CPU function that returns the sum of the elements in a GPU array. Of course, this function 
can itself invoke a GPU kernel. Finally, the importance sampling estimate is obtained by dividing 
this sum by N and memory is freed. Note that this code has been written so as to expose the 
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most common functions that are used in GPU programming using CUDA. For example, it would 
be faster to create the random variates on the GPU itself but this would not have allowed any 
memory transfer operations to be shown here. 

This basic example highlights the most important characteristics of CUDA programs: memory man- 
agement, kernel specification and kernel invocation. Memory management is a key component in 
algorithm design using graphics cards since there is often need for transfer between CPU and GPU 
memory as standard host functions can only access CPU memory and kernels can only access GPU 
memory. The fundamental memory operations are cudaMalloc, cudaMemcpy and cudaFree, 
which are GPU analogues to the standard C functions malloc, memcpy and free. Kernel speci- 
fication requires ensuring that correct output will be given once all threads have returned. In the 
above example, it is clear that all concurrent threads will be executing the same instructions in 
parallel because there is no conditional branching, which occurs when different instructions are ex- 
ecuted in concurrent threads based on the result of a data-dependent runtime comparison. Indeed, 
while it is possible to specify arbitrary conditional branches within a kernel, this can lead to slow 
performance since threads in a SIMD architecture execute sequentially when they are not execut- 
ing the same instructions, which can be devastating to performance. An important constraint on 
kernel code that is not illustrated explicitly in the above code is that neither recursive functions 
nor function pointers can be defined in device code. This is due to the fact that kernel functions 
are completely determined at compile time with __device__ functions simply inlined, or inserted 
into the kernel, during compilation. With respect to kernel invocation, the number of threads and 
blocks assigned to each kernel can be decided at runtime in host code. This is useful since com- 
putation time can depend strongly on these numbers and optimal configurations will vary across 
graphics cards and features of the data. A final remark is that the level of abstraction provided 
by CUDA is close to the hardware operations on the device. This ensures that programmers are 
acutely aware of the benefits of writing, for example, kernels with minimal interaction between 
threads and avoiding branching. 



2.2 Blocks and Threads 



CUDA abstracts the hardware of the GPU into blocks and threads to simultaneously provide a 
relatively simple view of the architecture to developers while still allowing a low-level abstraction 
of the hardware for performance reasons. One can generally think of each thread as being computed 
on a virtual processor. The block abstraction is necessary to provide the concept of a virtual micro- 
processor. Threads within a block are capable of more interaction than threads in separate blocks, 
mainly due to the fact that all threads in a block will be executed on the same microprocessor. As 
such, they have access to very fast, dynamically allocated, on-chip memory and can perform simple 
barrier synchronization. In Section 12.11 this advanced functionality is not required by the example 
kernel. 

It is important to note that blocks and threads are still very much virtual constructs. At runtime, 
multiple blocks may be executed concurrently on the same multiprocessor. With respect to ALU 
execution, operations are performed on groups of 32 threads at a time, which allows each of the 
8 scalar processors to perform 4 identical instructions in quick succession in an ideal setting. The 
group of 32 threads that executes simultaneously is called a warp. 
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2.3 Single Precision Issues 



The current generation of GPUs is 4-8 times faster at single precision arithmetic than double 
precision. Although this ratio will decrease in the future, there will probably remain a factor 
2 difference in speed, the same as for Intel CPUs when using SSE instructions. This raises the 
question of whether single precision arithmetic is adequate for statistical applications. 

There are two particular areas in which care must be taken. The first concerns the much more 
limited range of single precision floating point numbers. Because of their 8-bit exponent, their 
magnitude must lie in the approximate range [10"^*^, 10+^^], whereas the magnitude of double pre- 
cision variables is in the approximate range [10"^''^, 10"'^^''^]. Consequently, when working in single 
precision it is often necessary to work with log-likelihoods, rather than the likelihoods themselves 
though this is rarely a restriction for statisticians. 



The second area of potential issues concerns the averaging of floating point values, for 1. 
The simplest implementation uses an accumulator, to which the values are added one at a time. 
However, this may lead to a large increase in the error due to finite machine precision. When all 
of the values are of the same sign, the relative error is amplified by factor 0{N) in the worst case, 
and 0(\/iV) in the more typical case where the rounding error at e ach step c an be mode lled as a 
random variable with zero mean. This behaviour is well understood tiighaid (Il993l . booi ) and the 
growth can be reduced to 0(log A^) by using a binary tree summation algorithm in which the values 
are summed in pairs, and then those new values are summed in pairs, and the process is repeated 
until a single value is obtained. This is the natural approach for the parallel implementation of 
a reduction operation. Example code is provided by NVIDIA on their CUDA website, and the 
implementation in our code is based on this. 



Despite these concerns, single precision seems perfectly sufficient for the applications in this paper. 
The statistical variability due to the use of random numbers within the algorithms exceeds the 
perturbations due to finite machine precision. 



2.4 GPU Parallelizable Algorithms 



In general, if a computing task is well-suited to SIMD parallelization then it will be well-suited to 
computation on a GPU. In particular, data-parallel computations with high arithmetic intensity 
(computations where where the ratio of arithmetic operations to memory operations is high) are able 
to attain maximum performance from a GPU. This is because the volume of very fast arithmetic 
instruction can hide the relatively slow memory accesses. It is crucial to determine whether a 
particular computation is data-parallel on the instruction level when determining suitability. From 
a statistical simulation perspective, integration via classical Monte Carlo or importance sampling 
are ideal computational tasks in a SIMD framework. This is because each computing node can 
produce and weight a sample in parallel, assuming that the sampling procedure and the weighting 
procedure have no conditional branches. If these methods do branch, speedup can be compromised 
by many computing nodes running idle while others finish their tasks. This can occur, for example, 
if the sampling procedure uses rejection sampling. 
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In contrast, if a computing task is not well-suited to SIMD parallelization then it will not be well- 
suited to computation on a GPU. In particular, task-parallel computations where one executes 
different instructions on the same or different data cannot utilize the shared flow control hardware 
on a GPU and often end up running sequentially. Even when a computation is data-parallel, it 
might not give large performance improvements on a GPU due to memory constraints. This can 
be due to the number of registers required by each thread (see Sections 14.21 and E]) or due to the 
size and structure of the data necessary for the computation requiring large amounts of memory to 
be transferred between the host and the graphics card. The latter issue is analogous to the issue of 
thrashing in virtual memory systems and can occur, for example, when an algorithm iterates over 
a block of data that will not fit in memory. 



There are also many computational tasks in statistical computing that are just difficult to paral- 
lelize. For example, standard Metropolis-Hastings MCMC with a single chain is difficult to paral- 
lelize in the general case because it is a naturally sequential algorithm. Paral lelization of this type of 
metho d usually involves parallelization of the target density evaluation as in lSuchard and Rambaut 
i,he sampling from or evaluatio n of the proposal density or computation of multiple possible 
execution paths as in iBrockwelll Jiooi) as opposed to parallelization of the general algorithm itself. 



The availability of new hardware suited to parallel computation motivates use of a new model of 
computation for developing and analyzing the efficacy of statistical algorithms. In some cases, 
existing algorithms will require little modification to take advantage of this technology whilst in 
others major changes will have to be made. There is also the potential for previously impractical 
and novel algorithms to become important tools for statisticians. 



2.5 Parallel Random Number Generation 



One important aspect of any Monte Carlo simulation is the generation of pseudorandom numbers. 
Fortunately, many uniform pseudorandom number generators can be implemented efficiently in 
parallel. The key idea is that each thread computes a contiguous block of numbers within a single 
overall stream. The thread can jump to the start of its block of numbe rs using a "skip-ahea d" 
algorithm which enables it to skip n places in O(logn) operations (e.g. see lL'Ecuyer et al.l ( 20021 )). 
The uniform pseudorandom numbers can then be transformed to match various different output 
distributions as needed. In our ap plications we use a parallelized version of the multiple recursive 
generator MRG32k3a pre sented in [L 'Ecuveil (I1999I I as well parallelized version of a xorshift 
random number generator iMarsaglial (|2003l ). In the case of the xorshift random number generator, 
more time must be spent to compute the seeds for each thread before any computation is done but 
the random number generation itself is faster and the initialization can be done offline. 



3 Parallelizable Sampling Methods 



In this section we consider a number of sampling methods for which parallel implementations can 
be produced without significant modification. There is an abundance of statistical problems that 
are essentially computational in nature, especially in Bayesian inference. In many such cases, the 
problem can be distilled into one of sampling from a probability distribution whose density tt we 
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can compute pointwise and up to a normalizing constant, i.e. we can compute vr*(-) where 7r(x) = 
-7r*(x)/Z. A common motivation for wanting samples from vr is so we can compute expectations of 
certain functions. If we denote such a function by (p, the expectation of interest is 

I — (?i)(x)7r(x)dx 

The Monte Carlo estimate of this quantity is given by 

1=1 

where {x^*)}^^ are samples from tt. 



Clearly, we need samples from vr in order to compute this estimate. In practice, however, we often 
cannot sample from vr directly. There are two general classes of methods for dealing with this. The 
first are importance sampling methods, where we generate weighted samples from vr by generating 
N samples according to some importance density 7 proportional to 7* and then estimating / via 



N 



i=l 



where W^^^ are normalized importance weights 

The asymptotic variance of this estimate is given by C{(j), vr, 7) /N, i.e. a constant over A^. For many 
problems, however, it is difficult to come up with an importance density 7 such that C{(j), vr, 7) is 
small enough for us to attain reasonable variance with practical values of N. 



The second general class of methods are MCMC methods, in which we construct an ergodic vr- 
stationary Markov chain sequentially. Once the chain has converged, we can use all the dependent 
samples to estimate /. The major issue with MCMC methods is that their convergence rate can 
be prohibitively slow in some applications. 



There are many ways to parallelize sampling methods that are not the focus of this work. For 
example, naive importance sampling, like classical Monte Carlo, is intrinsically parallel. Therefore, 
in applications where we have access to a good importance density 7 we can get linear speedup 
with the number of processors available. Similarly, in cases where MCMC converges rapidly we can 
parallelize the estimation of / by running separate chains on each processor. While these situations 
are hoped for, they are not particularly interesting from a parallel architecture standpoint because 
they can run equally well in a distributed system. Finally, this paper is not concerned with problems 
for which the computation of individual MCMC moves or importance weights are very expensive 
but themselves parallelizable. While the increased availability of parallel architectures will almost 
certainly be of help in such cases, the focus here is on potential speedups by parallelizing general 
sampl ing methods. An example of recent work in this area can be found in lSuchard and Rambaut 
(I2OO9I '). in which speedup is obtained by parallelizing evaluation of individual likelihoods. 
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Much work in recent years has gone into deahng with the large constants in the variance of impor- 
tance sampUng estimates and slow convergence rates in MCMC and it is in these 'advanced' Monte 
Carlo methods that we direct our interest. This is mainly because while they are parallelizable, 
they are not trivially so and stand to benefit enormously from many-core architectures. In the 
remainder of this section we briefly review three such methods: population-based MCMC, SMC 
and SMC samplers. 



3.1 Population-Based MCMC 



A common technique in facilitating sampling from a complex distribution vr with support in X is 
to introduce an auxiliary variable a £ A and sample from a higher-dimensional distribution tt with 
support in the joint space A x X, such that vf admits vr as a marginal distribution. With such 
samples, one can discard the auxiliary variables and be left with samples from vr. 

This idea is utilized in population-based MCMC, which attempts to speed up convergence of an 
MCMC chain for tt by instead constructing a Markov chain on a joint space using M — 1 
auxiliary variables each in X. In general, we have M parallel 'subchains' each with stationary 
distribution iTi,i £ M = {1, . . . ,M} and ttm = tt. Associated with each subchain i is an MCMC 
kernel that leaves vTj invariant, and which we run at every time step. Of course, without any 
further moves, the stationary distribution of the joint chain is 

AI 

Vr(xi:A/) = Jj7ri(Xj) 
i=l 

and so xa/ ~ vr. This scheme does not affect the convergence rate of the independent chain 
M. However, since we can cycle mixtu res of vf- s tation ary MCMC kernels without affecting the 



stationary distribution of the joint chain iTierneyl (119941). we can a l low certain types of interactio n 



between the subchains which can speed up convergence (lGeveilll99ll : iHukushima and Nemotolll996l ') . 
In general, we apply a series of kernels that act on subsets of the variables. For the sake of clarity, 
let us denote the number of second-stage kernels by R and the kernels themselves as Ki, . . . ,K{i, 
where kernel Kj operates on variables with indices in 2j G A4. The idea is that the R kernels are 
executed sequentially and it is required that each Kj leave HieXj '^^ invariant. 

Given vr, there are a wide variety of possible choices for M, tti-m-i, L^-m, R.I^-.r and , r which 



will affect the convergence rate of the joint chain. For those interested, iJasra et al.l (|2007l ) gives 
a review of some of these. It is clear that the first stage of moves involving Li-m is trivially 
parallelizable. However, the second stage is sequential in nature. For a parallel implementation, it 
is beneficial for the Tj's to be disjoint as this allows the sequence of exchange kernels to be run in 
parallel. Of course, this implies that Ii;R should vary with time since otherwise there will be no 
interaction between the disjoint subsets of chains. Furthermore, if the parallel architecture used 
is SIMD (Single Instruction Multiple Data) in nature, it is desirable to have the Kj^s be nearly 
identical algorithmically. The last consideration for parallelization is that while speedup is generally 
larger when more computational threads can be run in parallel, it is not always helpful to increase 
M arbitrarily as this can affect the convergence rate of the chain. However, in situations where a 
suitable choice of M is dwarfed by the number of computational threads available, one can always 
increase the number of chains with target vr to produce more samples. 
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3.2 Sequential Monte Carlo 



SMC methods are a powerful extension of importance sampling methodology that are particularly 
popular for sampling from a sequence of probability distributio ns. In the context of state- space 
models, th ese methods are known as particle filtering methods; Doucet and Johansen ( 20081 ) and 



Liu ( 20081 ) give recent surveys of the field. In this context, let {xt}^>Q be an unobserved Markov 



process of initial density xq ~ Po{') and transition density x^ ~ /(-Ixt-i) for t > 1. We only 
have access to an observation process {yt}t>o: the observations are conditionally independent 
conditional upon {x^jj^Q of marginal density ~ g{yt\xt) for t > 1. SMC methods are used 
to approximate recursively in time the filtering densities p(xo:t|yo:t) which are proportional to 
p{^0:t,yo:t) = Po(xo) nLi n*=i 5(yj|xi) for t = 1,...,T. These distributions are ap- 
proximated with a set of random samples called particles through use of a sequential version of 
importance sampling and a special particle-interaction step known as resampling. 

Parallelization of SMC methods is reasonably straightforward. The importance sampling step used 
at each time step is trivially parallelizable as it involves only the local state of a particle. The 
resampling step, in which some particles are replicated and others destroyed depending on their 
normalized importance weights, comprises the construction of an empirical cumulative distribution 
function for the particles based on their importance weights followed by sampling from this 
times, where is the fixed number of particles used throughout the computation. While neither 
of these tasks is trivially parallelizable, they can benefit moderately from parallelization. However, 
the bulk of the speedup will generally come from the par a.llelization of the evo lution and weighting 
steps. Therefore, using criteria like effective sample size (|Lhi and Chenlll99,^ ^ to avoid resampling 
at every time step can also improve speedup. 



3.3 Sequential Monte Carlo Samplers 



SMC samplers (jPel Moral et al.l hood ) are a more general class of methods that utilize a se- 



q uence of aux i liary distributions ttq, . . . ,vrr, much like population-based MCMC as discussed out 



m 



Jasra et al.l (120071 ). However, in contrast to population-based MCMC, SMC samplers start from 
an auxiliary distribution vro and recursively approximate each intermediate distribution in turn 
until finally vr^^ = vr is approximated. The algorithm has the same general structure as classical 
SMC, with differences only in the types of proposal distributions, target distributions and weighting 
functions used in the algorithm. As such, parallelization of SMC samplers closely follows that of 
SMC. 

The difference between population-based MCMC and SMC samplers is subtle but practically im- 
portant. Both can be viewed as population-based methods on a similarly defined joint space since 
many samples are generated at each time step in parallel. However, in population-based MCMC the 
samples generated at each time each have different stationary distributions and the samples from a 
particular chain over time provide an empirical approximation of that chain's target distribution. 
In SMC samplers, the weighted samples generated at each time approximate one auxiliary target 
distribution and the true target distribution is approximated at the last time step. This difference 
is further discussed in Section [4.1.31 
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4 Canonical Examples 



To demonstrate the types of speed increase one can attain by utilizing GPUs, we apply each method 
to a representative statistical problem. We use Bayesian inference for a Gaussian mixture model as 
an application of the population-based MCMC and SMC samplers, while we use a factor stochastic 
volatility state-space model to gauge the speedup of our parallel SMC method. We ran our parallel 
code on a computer equipped with an NVIDIA 8800 GT GPU, a computer equipped with an 
NVIDIA GTX 280 GPU and we ran reference single-threaded code on a Xeon E5420 / 2.5 GHz 
processor. The resulting processing times and speedups are given in Tables [1] - [3j 

The applications we discuss here are representative of the types of problems that these methods are 
commonly used to solve. In particular, while the distribution of mixture means given observations 
is only one example of a multimodal distribution, it can be thought of as a canonical distribution 
with multiple well-separated modes. Therefore, the ability to sample points from this distribution is 
indicative of the ability to sample points from a wide range of multimodal distributions. Similarly, 
performance of a latent variable sampler in dealing with observations from a factor stochastic 
volatility model is indicative of performance on observations from reasonably well-behaved but 
non-linear and non-Gaussian continuous state-space models. 



4.1 Mixture Modelling 



Finite mixture models are a v ery popular class of statist ical models as they provide a flexible way 
to model heterogeneous data ( McLachlan and Peell 200ol ) . Let y = yi;m denote i.i.d. observations 



where yj £ M for j G {1, • ■ • ,m}. A univariate Gaussian mixture model with k components states 
that each observation is distributed according to the mixture density 

k 

P{yj\^J'l:k,Cri:k,Wl:k-l) = Wif {Vjl^J-i, (^i) 

i=l 

where / denotes the density of the univariate normal distribution. The density of y is then equal 

to llf=lP{yj\lJ'l:k,(^hk,Wi:k-l)- 

For simplicity, we assume that k, wi-k-i and di-k are known and that the prior distribution on /i 
is uniform on the /c-dimensional hypercube [—10, lO]'^. We set k = A, Ui = a = 0.55, Wi = w = 1/k 
for i £ {1, . . . , A:}. We simulate m = 100 observations for /x = fii-^ = (—3,0,3,6). The resulting 
posterior distribution for /x is given by 

p(Hy)ocp(y|/^)I(/xG [-10,10]^) 

The main computational challenge associated to Bayesian inference in finite mixture models is the 
nonidentifiability of the components. As we have used exchangeable priors for the parameters Hi-a, 
the posterior distribution p(/x|y) is invariant to permutations in the labelling of the parameters. 
Hence this posterior admits fc! = 24 symmetric modes. 

Generating N samples from such a posterior is a popular method for determining the ability of 
samplers to explore a high-dimensional space with multiple well-separated modes, which should all 
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be represented in the samples. Basic random-walk MCMC and importance sampling methods typi- 



cally fail to provide a correct approximation of the posterior for practical values of N (jCeleux et al, 



2000). It should be noted that while it might not be necessary to sample from all the symmet 



ric modes in the case of a mixture model, the successful traversal of all the modes suggests that 
the sampler would succeed in traversing non-symmetric modes in other distributions, so long as 
symmetry is not exploited by the sampler. 



4.1.1 Population-Based MCMC 



We select the auxiliary distributions tti-hj^i following the parallel tempering methodology, i.e. 
7rj(x) tx 7r(x)^' with < /?i < • • • < Pm = 1 and use M = 200. This class of auxiliary distributions 
is motivated by the fact that MCMC converges more rapidly when the target distribution is flatter. 
For this problem, we use the cooling schedule f3i = (i/M)'^ and a standard ^[{0,1^) random walk 
Metropolis-Hastings kernel for the first stage moves. 



For t he second stage moves, we use only the basic exchange move ( Gever 1991 : Hukushima and Nemotol 



199fil ): chains i and j swap their values with probability min{l,ajj} where 

7ri(xj)vrj(xi) 



a,: 



7ri(xj)7rj(xj 



Further, we allow exchanges to take place only between adjacent chains so that all moves can 
be done in parallel. We use R = M/2 and Ii-.r is either {{1, 2}, {3, 4}, . . . ,{M - 1,M}} or 
{{2, 3}, {4, 5}, . . . , {M-2, M-1}, {M, 1}}, each with probability half. While use of permutation or 
crossover moves would be appropriate for this particular model, we felt that they would detract from 
the ability to generalize our results to the case where the likelihood is not invariant to permutations 
of the labels. 



Table 1: Running times for the Population-Based MCMC Sampler for various numbers of chains 
M. 



M 


CPU (mins) 


8800GT (sees) 


Speedup 


GTX280 (sees) 


Speedup 


8 


0.0166 


0.887 


1.1 


1.083 


0.9 


32 


0.0656 


0.904 


4 


1.098 


4 


128 


0.262 


0.923 


17 


1.100 


14 


512 


1.04 


1.041 


60 


1.235 


51 


2048 


4.16 


1.485 


168 


1.427 


175 


8192 


16.64 


4.325 


230 


2.323 


430 


32768 


66.7 


14.957 


268 


7.729 


527 


131072 


270.3 


58.226 


279 


28.349 


572 



To test the computational time required by our algorithms we allow the number of chains to vary 
but fix the number of points we wish to sample from the marginal density ttm = vt at 8192. As 
such, an increase in the number of chains leads to a proportional increase in the total number of 
points sampled. Processing times for our code are given in Table [H in which one can see that 
using 131072 chains is impractical on the CPU but entirely reasonable using the GPU. Figure [5] 
shows the estimated posterior density p(Aii:2|y) from a set of 2^*^ MCMC samples from ttm with 
M = 32768, which is nearly identical to the estimated marginal posterior densities of any other pair 
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of components of /x. This marginal density has 12 weh-separated modes in M? but it is worth noting 
that the joint density j'(/ii:4|y) has 24 well-separated modes in M'*. Figure [6] shows the number of 
points from each mode for various values of M. We also computed the average number of iterations 
taken for the samplers to traverse all modes for the different values of M. For M = 1 and M = 2, 
the sampler did not traverse all the modes at all, while for values of M between 4 and 32 the 
traversal time decreased from 80000 to 10000, after which it was unchanged with increases in M. 
These numbers should be compared to 24 x 1/24 ~ 91 - the expected number of samples required to 
cover every mode if one could sample independently from vr - where Hi is the ith harmonic number. 




Figure 5: Estimated marginal posterior density p(/Ui:2|y) from MCMC samples 



4.1.2 SMC Sampler 



As with population-based MCMC, we use a tempering approach and the same cooling schedule, 
i.e. '7rj(x) oc 7r(x)^* with (3t = (t/M)"^ and M = 200. we use the uniform prior on the hyper- 
cube to generate the samples {xq^'^'*} and perform 10 MCMC steps with the standard J\f{0,lk) 
random walk M etropolis - Hasti n gs kernel at ever y time step. We use th e generic backwards ker- 
nel suggested in ICrooksl (Il998l '). iNeall (I2OOII ) and lOel Moral et all tood ) for the case where each 
kernel is vrt-stationary so that the unnormalized incremental importance weights are of the form 
7r4(xt_i)/7rt_i(xi_i). 



Table 2: Running times for the Sequential Monte Carlo Sampler for various values of A^. 



N 


CPU (mins) 


8800GT (sees) 


Speedup 


GTX280 (sees) 


Speedup 


8192 


4.44 


1.192 


223.5 


0.597 


446 


16384 


8.82 


2.127 


249 


1.114 


475 


32768 


17.7 


3.995 


266 


2.114 


502 


65536 


35.3 


7.889 


268 


4.270 


496 


131072 


70.6 


15.671 


270 


8.075 


525 


262144 


141 


31.218 


271 


16.219 


522 



15 




Figure 6: Number of MCMC samples from each mode 



We also ran the SMC sampler with no resampling at all, whi ch for these settings corresponds to 
the Annealed importance sampling (AIS) method proposed in Neall ( 2001 ): see also Crooks ( 19981 ) 
for a similar method in physics. The resulting samples were less successful at characterizing the 
full multi-modality of the poster ior distribution. This i s consistent with the numerical findings and 
theoretical results discussed in ()Del Moral et al.ll2006l . Sections 3.5 and 4.2.3): when the MCMC 
kernels used at every time step mixes reasonably well, it is very beneficial to resample. In addition, 
when ESS is used as a threshold for resampling in the SMC sampler, the resampling step takes very 
little time compared to the evolution and weighting of the particles simply because it happens so 
infrequently. As such, the running time of the SMC sampler compared to AIS is practically identical. 
Processing times for our code are given in Table [21 Figure [7] shows the estimated posterior density 
p(/^i:2|y) from the SMC sampler with N = 65536. Figure [8] shows the number of points from each 
mode for various values of N. 



4.1.3 Comparison 



While both methods are capable of exploring the posterior distribution for /x, there are important 
differences in how the methods make use of parallelization. In particular, the SMC sampler par- 
allelizes across particles approximating the same auxiliary distribution whilst the MCMC sampler 
parallelizes across auxiliary distributions at the same iteration. As such, to make full use of the 
graphics card the SMC sampler requires many particles while the MCMC sampler requires many 
auxiliary distributions. In most cases, however, one will be happy to use in excess of 8192 particles 
for SMC but one may not want to use in excess of 32768 auxiliary distributions. Indeed, for the 
application described above there seems to be no benefit in increasing the number of chains beyond 
128, although this might also be due to the choice of cooling schedule and random walk variances. 
Furthermore, we utilized only the simplest information exchange and proposal moves in our sam- 
plers so as not to trivialize the problem. It should be noted, however, that there are situations in 
which a large number of intermediate temperatures are required for exchange acceptance probabil- 
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Figure 7: Estimated marginal posterior density p{fJ-i:2\y) from SMC samples 




N=1024 N=2048 N=4096 N=8192 
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Figure 8: Effective number of SMC samples from each mode 
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ities to be greate r than some preset va lue, for example when the dimension of the distribution of 
interest increases IPredescu et"al] ^2()0i ]. As mentioned in Section r4.1.1l we would like to emphasize 



that the use of very large numbers of chains is possible using these parallel methods with only a 
modest increase in computation time. 

The SMC sampler appears to be more efficient than the MCMC sampler for this problem. Indeed, 
with only 8192 particles the SMC sampler gives a reasonable representation of the posterior, taking 
only 597ms. The MCMC sampler requires around 2^^* samples to give a reasonably uniform number 
of samples per mode, and this takes just over 2 minutes. In addition, although we have not done so 
here, it is possible with both the SMC and MCMC approaches to use the samples from the auxiliary 
distributions to estimate integrals of interest by computing appropriate importance weights. An 
in teresting recent p ropos al on how to effectively combine estimates using such samples can be found 



m 



Gramacv etHl ^200^ ). 



For Bayesian inference in r nixtur e mod els, there are many ways of dealing with the identifiability 
of the mixture parameters; Jasra ( 20051 ) includes a review of these. It is worth mentioning that for 



this type of model, we can permute samples as a post-processing step or within an MCMC kernel 
so traversal of the modes can be achieved trivially. The speedup of both methods is unaffected by 
the use of such mechanisms. In addition, the speedup is unaffected by increases in the number of 
observations since this only increases the amount of computation that each thread must do by a 
constant factor. Increasing the number of observations also has little effect on the difficulty for the 
sampler to move between modes since the modes are already well separated. Similarly, the speedup 
observed is robust to changes in the number of mixture components. The computation of each 
likelihood requires memory that is linear in the number of components, while the memory required 
per thread dictates the number of threads that can be run in parallel. However, as the number of 
components increases we usually have more observations, providing an opportunity to parallelize 
the computation of a single likelihood across multiple threads. This allows the amount of memory 
per thread to be significantly lower than if the complete likelihood was calculated by each thread 
and thus still allows many threads to be run in parallel. 



4.2 Factor Stochastic Volatility 



Many financial time series exhibit changing variance. A simple multivariate volatility model that 
allows us to capture the changing cross-covariance patterns of time series consists of using a dy- 
namic latent factor model. In such models, all the variances and covariances a re modelled through 
a low dimensional stoch astic volatility structure driven by common factors (iLiu and WestJ l200nl : 
Pitt and Shephardlll999l). We consid er here a factor stochastic volatility model most similar to 
that proposed in iLiu and We^ (|200nl ): 



yt ~ Af{Bft, *) 
ft ~AA(0,Ht) 
xt ~7^(*xt_i,U) 
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where 



* = diag(V'i,. . . ,V'm) 
Hi = diag(exp(xt)) 

* = diag((/)i, ...,(I)k) 



Here, ft is X-dimensional, is A/-dimensional and B is an M x X factor loading matrix with zero 
entries above the diagonal for reasons of identifiability. The latent variable at each time step t is 
the i^T-dimensional vector x^. The likelihood of the data, y^, given x^ is Gaussian with 

yt|xi ~AA(0,BHiB^ + *) 



We generate data for times t 
0i =0.9,i G {1,...,K}, 



B 



1, . . . , T = 200, M = 5, K = 3, yio = 0,ipi = 0.5, i G {1, . . . , M}, 



/ 1 
0.5 

0.5 

0.2 

V 0.8 



\ 

1 
0.5 1 
0.6 0.3 
0.7 0.5 / 



and U 




This is a simple example of a multivariate, non-linear and non-Gaussian continuous state-space 
model for which particle filters are commonly employed to sample from the posterior p{xQ-T\yi:T)- 
Processing times for our code are given in Table [3l In Figure [9] we plot the filter means for each 
component of x with ±1 sample standard deviations alongside the true values of x used to generate 
the observations. 

Table 3: Running time (in seconds) for the Sequential Monte Carlo method for various values of 
N. 



N 


CPU 


8800GT 


Speedup 


GTX280 


Speedup 


8192 


2.167 


0.263 


8 


0.082 


26 


16384 


4.325 


0.493 


9 


0.144 


30 


32768 


8.543 


0.921 


9 


0.249 


34 


65536 


17.425 


1.775 


10 


0.465 


37 


131072 


34.8 


3.486 


10 


.929 


37 



The speedups obtained in this application are considerably less than for mixture model inference 
problem. This can be explained by lower arithmetic intensity, higher space complexity in each 
thread and increased resampling rate as compared to the SMC sampler example above. The 
mixture model likelihood calculation contains a compute-intensive product-sum operation involving 
104 values whilst the factor stochastic volatility likelihood consists mainly of matrix operations. 
In the latter case, the speedup is independent of T but not the dimension of the observations 
since the amount of memory required per thread increases quadratically in the dimension of each 
observation. For example, we attained a speedup of 80 on the GTX 280 when running a particle 
filter for a multivariate stochastic volatility model with M = K = 2. The frequency of resampling 
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is an issue with respect to speedup because it can typically only attain around 10 to 20 fold speedup 
for practical values of N, mainly due to the parallel scan operation. This potentially gives rise to 
tradeoffs in speedup between the transition and weighting steps and the time between resampling 
steps for some models, since more sophisticated proposal distributions that parallelize less cleanly 
might reduce the resampling rate. This type of performance, however, still provides considerable 
speedup and may be more representative of the type of speedup practitioners can expect in general. 

4.3 Floating Point Precision 

For all three algorithms discussed above, we ran identical algorithms with the same random num- 
bers on the CPU using double precision floating point numbers and the resulting estimates of 
expectations of interest were affected by an order of magnitude less than the Monte Carlo variance 
of the estimates. The actual paths sampled, of course, were different due to the sensitivity of all of 
the algorithms to perturbations but this did not affect the ability of the samples to approximate 
the target distribution. 



5 Discussion 

The speedup for the population-based MCMC algorithm and the SMC sampler is tremendous. 
In particular, the evaluation of p(y|/^) for the mixture-modelling application has high arithmetic 
intensity since it consists of a product-sum operation with 400 Gaussian log-likelihood evaluations 
involving only 104 values. In fact, because of the low register and memory requirements, so many 
threads can be run concurrently that SIMD calculation of this likelihood can be sped up by 500 
times on the 8800 GT and 800 times on the GTX 280. However, the speedup attained for the 
standard SMC algorithm may be more representative of the kinds of gains one can expect in 
most applications with only reasonable arithmetic intensity. Even so, speedups of 10 to 35 make 
many problems tractable that previously were not by reducing a week's worth of computation 
to a few hours. For example, estimation of static parameters in continuous state-space models 
or the use of SMC proposals within MCMC can require thousands of runs, so a s peedup of this 
scale can substantially reduce the computation time of such approaches (see e.g. lAndrieu et al. 



(120091 )1. ^t is worth noting also that we can expect speedups in the vicinity of 500 with SMC if few 



resampling steps are required and each weighting step has small space complexity and moderate 
time complexity. 

The GTX 280 GPU outperforms the 8800 GT GPU by a factor of about 2 in all situations in which 
the GPU is used to capacity. This is the case in all but the population-based MCMC algorithm, in 
which the number of threads is determined by the number of auxiliary distributions. The reason for 
this is simple: the algorithms presented are register-bound on the inputs given, in that the number 
of registers required by each thread is the critical quantity that bounds the number of threads 
that can be run concurrently. The GTX 280 has twice the number of registers per multiprocessor 
and more than twice the multiprocessors compared to the 8800 GT. Hence, one could expect more 
speedup on many-core chips with even more registers. In fact, further improvements could be made 
using multiple cards with large amounts of memory, configurations of which are now available in 
NVIDIA's Tesla line. These Tesla 'personal supercomputers' comprise 3 or more high-performance 
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GPUs, each with 4GB of memory and a CPU with at least as much memory as the CPUs' combined. 
It is also possible to design algorithms that are memory-bound, though we have not encountered 
this in the context of Monte Carlo simulation. It is certainly possible that both register and 
memory limitations can affect the parallelizability of the mentioned algorithms when facing very 
high-dimensional problems. However, in such cases it is possible that alternative uses of many-core 
architecture can provide speedup in these situations. 

The acceleration of the Monte Carlo methods discussed here have practical benefits not only to 
computation time but also to energy efficiency. A general purpose CPU allocates extra circuits 
and hence power to flow control and caching which is unnecessary for the types of computation 
described here. As such, reasonable decreases in power consumption can be realized by using 
specialized many-core architectures like SIMD instead. 

It should be noted that while we have used CUDA to implement the parallel components of al- 
gorithms, the results are not necessarily specific to this framework or to CPUs. It is expected 
that the many-core processor market will grow and there will be a variety of different devices and 
architectures to take advantage of. However, the SIMD parallelization technique and the sacrifice 
of caching and flow control for arithmetic processing is unlikely to disappear, particularly because 
when it is well-suited to a problem it will nearly always deliver considerable speedup. In addition, 
CPUs are affordable, off-the-shelf components that can be easily installed on a personal computer. 
Of particular interest is an emerging framework, the Open Computing Language (OpenCL), which 
provides a uniform programming environment for developers that enables them to write portable 
code for a variety of parallel devices, including GPUs and CPUs. For users who would like to see 
moderate speedup with very little effort, there is work being done to develop libraries that will take 
existing code and automatically generate code that will run on a CPU. An example of this is the 
Jacket engine for MATLAB code, created by Accelereyes. 

The speedups attainable with many-core architectures have broad implications in the design, anal- 
ysis and application of SMC and population-based MCMC methods. With respect to SMC, it 
allows more particles to be used for the same or even less computation time, which can make these 
samplers viable where they previously were not. When faced with designing a population-based 
MCMC sampler, the results expectedly show that there is little cost associated with increasing the 
number of auxiliary distributions until the GPU reaches the critical limit of threads it can run 
concurrently. After this, there is a doubling in the computation time when the number of chains is 
doubled. In our application, this does not occur until we have around 4096 auxiliary distributions. 
One might notice that this number is far larger than the number of processors on the GPU. This 
is due to the fact that even with many processors, significant speedup can be attained by having a 
full pipeline of instructions on each processor to hide the relatively slow memory reads and writes. 
Of course, we can expect this application-specific number to decrease when dealing with higher- 
dimensional distributions or those whose density evaluations require more registers or memory. 
Nevertheless, practitioners have more freedom to increase the number of auxiliary distributions to 
achieve a faster rate of convergence as the computation time associated with each step is not as 
closely tied to this value as it is on a single-core processor. In both SMC and MCMC, it is also clear 
from this case-study that it is beneficial for each thread to use as few registers as possible since 
this determines the number of threads that can be run simultaneously. This may be of interest to 
the methodology community since it creates a space-time tradeoff that might be exploited in some 
applications. 
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A consequence of the space-time tradeoff mentioned above is that methods which require large 
numbers of registers per thread are not necessarily suitable for parallelization using GPUs. For 
example, operations on large, dense matrices that are unique to each thread can restrict the number 
of threads that can run in parallel and hence dramatically affect potential speedup. In cases where 
data is shared across threads, however, this is not an issue. For example, a mixture model with 
large amounts of data does not affect the number of registers required whilst increasing the number 
of components increases the number of registers required only linearly. In contrast, increasing the 
number of observed assets in a factor stochastic volatility model leads to a quadratic increase in the 
number of registers required, substantially affecting scalability in this regard. An increase in the 
number of observations itself has no effect on speedup. In principle, it is not the size of the data 
that matters but the space complexity of the algorithm in each thread that dictates how scalable 
the parallelization is. 

The parallelization of the advanced Monte Carlo methods described here opens up challenges for 
both practitioners and for algorithm designers. There are already an abundance of statistical 
problems that are being solved computationally and technological advances, if taken advantage of 
by the community, can serve to make previously impractical solutions eminently reasonable and 
motivate the development of new methods. 
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Appendix: Web Resource 

We have created a website resource at 'http : / / www . o xford-man . ox . ac . uk/gpuss/' for the 
statistics community with the code used in these examples as well as useful information on GPU 
programming for statistical computing with links to tutorials and relevant papers. 
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